Part 2: Exploring multdimensional and hierarchical data with interaction 🔍¶

In [3]:
import plotly.express as px 
import pandas as pd
import plotly.offline as plto
plto.init_notebook_mode()
# also needs: conda install -c conda-forge nbformat 
# or pip equivalent, see here: https://stackoverflow.com/questions/66557543/valueerror-mime-type-rendering-requires-nbformat-4-2-0-but-it-is-not-installed

Plotly (express) provides basic chart types for multidimensional data with interaction "out of the box". Here is an example from the documentation. Try out brushing with the mouse along the different axes!

👉 TODO 2.1: Choose a multidimensional dataset and explore it by creating interactive visualizations with Plotly. In your exploration, make use of at least one chart type for multidimensional data. See the Plotly overview of chart types here.

At the end of your exploration, write a short summary that reflects on the interactions you used and how they impacted your exploration (in addition to the reflections per chart, as before). For example, you could mention if they helped you to identify a specific pattern or gain a specific insight (or not).

Using the titanic dataset¶

In [4]:
data_folder = './data/'
df_titanic = pd.read_csv(data_folder+'tested.csv')
df_titanic.head(10)
Out[4]:
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 892 0 3 Kelly, Mr. James male 34.5 0 0 330911 7.8292 NaN Q
1 893 1 3 Wilkes, Mrs. James (Ellen Needs) female 47.0 1 0 363272 7.0000 NaN S
2 894 0 2 Myles, Mr. Thomas Francis male 62.0 0 0 240276 9.6875 NaN Q
3 895 0 3 Wirz, Mr. Albert male 27.0 0 0 315154 8.6625 NaN S
4 896 1 3 Hirvonen, Mrs. Alexander (Helga E Lindqvist) female 22.0 1 1 3101298 12.2875 NaN S
5 897 0 3 Svensson, Mr. Johan Cervin male 14.0 0 0 7538 9.2250 NaN S
6 898 1 3 Connolly, Miss. Kate female 30.0 0 0 330972 7.6292 NaN Q
7 899 0 2 Caldwell, Mr. Albert Francis male 26.0 1 1 248738 29.0000 NaN S
8 900 1 3 Abrahim, Mrs. Joseph (Sophie Halaut Easu) female 18.0 0 0 2657 7.2292 NaN C
9 901 0 3 Davies, Mr. John Samuel male 21.0 2 0 A/4 48871 24.1500 NaN S

How many people out of the total survived?¶

In [5]:
df_titanic['Survived'].value_counts()
Out[5]:
Survived
0    266
1    152
Name: count, dtype: int64

In the dataset which we have, 152 passengers survived

In [6]:
df_titanic['Age'].isnull().any()
Out[6]:
True

Plotting people of various age against their class(tcikets)¶

In [7]:
fig_3d = px.scatter_3d(df_titanic, x='Age', y='Pclass', z='PassengerId', color='Age')
fig_3d.show()

From the plot, we see that the age of most passengers are between 20 to 50. Few of these could be families with kids as we also see some blue color dots. There are very few senior citizens as compared to the majority working class passengers.

The number of tickets classes seem to be evenly distributed for the 3 available classes. We see more number of senior citizens in the first class as compared to the other two classes. It will be valuable to see how this graph varies for male and female passengers.

In [8]:
columns = ['Sex' ,'Pclass','Age','Survived','Fare']
df_parallel = df_titanic[columns]
fig2 = px.parallel_categories(df_parallel, color='Survived')
fig2.show()

We see that a large chunk of the 3rd classes passengers were males. In the second class too, the number of males are more than females. however, in the 1st classes passengers, there seem to be almost the same number female for every male.

In this dataset, only the fenales have survived and all the females have survived.

In [9]:
columns = ['Pclass' ,'PassengerId','Age','Fare','Sex']
df_Fare = df_titanic[columns]
fig2 = px.scatter_3d(df_Fare ,x = 'Sex',y ='PassengerId',z = 'Fare', color='Pclass')
fig2.show()
In [10]:
fig2 = px.scatter_3d(df_Fare ,x = 'Sex',y ='Age',z = 'Fare', color='Pclass')
fig2.show()

Looking at the Fare, there is not a big difference between 1st and 2nd class tickets, there maybe a possibility that there was early bird offers where the tickets were sold cheaper in the beginning.

Parallel Categories plots are best were we have categorical data. These plots ignore other columns with any other type of data while plotting.

Scatter Plot 3D are quite nice to visualize a trend in the numbers. They can also view categorical data but from one view they are always overlapping. It requires some expertise or analysis to set the correct axis for avoiding overlap.


👉 TODO 2.2: Choose a hierarchical dataset and explore it with Plotly. In your exploration, make use of at least one chart type for hierarchical data. Depending on your dataset, we recommend to use a treemap or a sunburst chart or a tree plot.

At the end of your exploration, write a short summary that reflects on the interactions you used and how they impacted your exploration (in addition to the reflections per chart, as before). For example, you could mention if they helped you to identify a specific pattern or gain a specific insight (or not).

In [11]:
data_folder = './data/'
titanic = pd.read_csv(data_folder+'train.csv')
titanic.head(10)
Out[11]:
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S
5 6 0 3 Moran, Mr. James male NaN 0 0 330877 8.4583 NaN Q
6 7 0 1 McCarthy, Mr. Timothy J male 54.0 0 0 17463 51.8625 E46 S
7 8 0 3 Palsson, Master. Gosta Leonard male 2.0 3 1 349909 21.0750 NaN S
8 9 1 3 Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg) female 27.0 0 2 347742 11.1333 NaN S
9 10 1 2 Nasser, Mrs. Nicholas (Adele Achem) female 14.0 1 0 237736 30.0708 NaN C
In [17]:
px.sunburst(titanic,path=["Sex","Pclass"],color="Fare",template="plotly_dark")

In the sunburst plot, the path attributes contains the columns which we want to display, number of columns is equal to the number of concentric circle, the order also matters. In the above plot, we can see that female survival rate is 40%,out of which most of the females from the 1st class survived followed by 2nd and 3rd class females with almost similar survival rate of 20%. male survival rate is 30%, out of which about 60% are the 1st class passengers and very few 2nd and 3rd class passengers.

In [18]:
fig = px.sunburst(titanic, path=['Survived', 'Sex', 'Pclass'], color='Survived')
fig.show()
In [22]:
fig = px.sunburst(titanic, path=['Sex', 'Pclass'], values='Fare', color = 'Fare')

# Update the layout of the chart
fig.update_layout(title='Titanic Survival: Fare Distribution')

# Display the chart
fig.show()

As expected the first class tickets are most expensive for both genders. 2nd and 3rd class dont seem to have a big difference in the cost. The ticket prices were only dependent on class and not gender. We also see that number of female passengers in 1st class are more than male passengers, however, overall the number of male passengers are more.